Goto

Collaborating Authors

 expression pattern


Forest-Guided Clustering -- Shedding Light into the Random Forest Black Box

Sousa, Lisa Barros de Andrade e, Miller, Gregor, Gleut, Ronan Le, Thalmeier, Dominik, Pelin, Helena, Piraud, Marie

arXiv.org Artificial Intelligence

As machine learning models are increasingly deployed in sensitive application areas, the demand for interpretable and trustworthy decision-making has increased. Random Forests (RF), despite their widespread use and strong performance on tabular data, remain difficult to interpret due to their ensemble nature. We present Forest-Guided Clustering (FGC), a model-specific explainability method that reveals both local and global structure in RFs by grouping instances according to shared decision paths. FGC produces human-interpretable clusters aligned with the model's internal logic and computes cluster-specific and global feature importance scores to derive decision rules underlying RF predictions. FGC accurately recovered latent subclass structure on a benchmark dataset and outperformed classical clustering and post-hoc explanation methods. Applied to an AML transcriptomic dataset, FGC uncovered biologically coherent subpopulations, disentangled disease-relevant signals from confounders, and recovered known and novel gene expression patterns. FGC bridges the gap between performance and interpretability by providing structure-aware insights that go beyond feature-level attribution.


Brain-wide interpolation and conditioning of gene expression in the human brain using Implicit Neural Representations

Yu, Xizheng, Torok, Justin, Pandya, Sneha, Pal, Sourav, Singh, Vikas, Raj, Ashish

arXiv.org Artificial Intelligence

In this paper, we study the efficacy and utility of recent advances in non-local, non-linear image interpolation and extrapolation algorithms, specifically, ideas based on Implicit Neural Representations (INR), as a tool for analysis of spatial transcriptomics data. We seek to utilize the microarray gene expression data sparsely sampled in the healthy human brain, and produce fully resolved spatial maps of any given gene across the whole brain at a voxel-level resolution. To do so, we first obtained the 100 top AD risk genes, whose baseline spatial transcriptional profiles were obtained from the Allen Human Brain Atlas (AHBA). We adapted Implicit Neural Representation models so that the pipeline can produce robust voxel-resolution quantitative maps of all genes. We present a variety of experiments using interpolations obtained from Abagen as a baseline/reference.


High-Resolution Spatial Transcriptomics from Histology Images using HisToSGE

Shi, Zhiceng, Xue, Shuailin, Zhu, Fangfang, Min, Wenwen

arXiv.org Artificial Intelligence

Spatial transcriptomics (ST) is a groundbreaking genomic technology that enables spatial localization analysis of gene expression within tissue sections. However, it is significantly limited by high costs and sparse spatial resolution. An alternative, more cost-effective strategy is to use deep learning methods to predict high-density gene expression profiles from histological images. However, existing methods struggle to capture rich image features effectively or rely on low-dimensional positional coordinates, making it difficult to accurately predict high-resolution gene expression profiles. To address these limitations, we developed HisToSGE, a method that employs a Pathology Image Large Model (PILM) to extract rich image features from histological images and utilizes a feature learning module to robustly generate high-resolution gene expression profiles. We evaluated HisToSGE on four ST datasets, comparing its performance with five state-of-the-art baseline methods. The results demonstrate that HisToSGE excels in generating high-resolution gene expression profiles and performing downstream tasks such as spatial domain identification. All code and public datasets used in this paper are available at https://github.com/wenwenmin/HisToSGE and https://zenodo.org/records/12792163.


evolSOM: an R Package for evolutionary conservation analysis with SOMs

Prochetto, Santiago, Reinheimer, Renata, Stegmayer, Georgina

arXiv.org Artificial Intelligence

Motivation: Unraveling the connection between genes and traits is crucial for solving many biological puzzles. Genes provide instructions for building cellular machinery, directing the processes that sustain life. RNA molecules and proteins, derived from these genetic instructions, play crucial roles in shaping cell structures, influencing reactions, and guiding behavior. This fundamental biological principle links genetic makeup to observable traits, but integrating and extracting meaningful relationships from this complex, multimodal data presents a significant challenge. Results: We introduce evolSOM, a novel R package that utilizes Self-Organizing Maps (SOMs) to explore and visualize the conservation of biological variables, easing the integration of phenotypic and genotypic attributes. By constructing species-specific or condition-specific SOMs that capture non-redundant patterns, evolSOM allows the analysis of displacement of biological variables between species or conditions. Variables displaced together suggest membership in the same regulatory network, and the nature of the displacement may hold biological significance. The package automatically calculates and graphically presents these displacements, enabling efficient comparison and revealing conserved and displaced variables. The package facilitates the integration of diverse phenotypic data types, enabling the exploration of potential gene drivers underlying observed phenotypic changes. Its user-friendly interface and visualization capabilities enhance the accessibility of complex network analyses. Illustratively, we employed evolSOM to study the displacement of genes and phenotypic traits, successfully identifying potential drivers of phenotypic differentiation in grass leaves. Availability: The package is open-source and is available at https://github.com/sanprochetto/evolSOM.


Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data

Payne, Andrea, Silva, Anjali, Rothstein, Steven J., McNicholas, Paul D., Subedi, Sanjeena

arXiv.org Machine Learning

A mixture of multivariate Poisson-log normal factor analyzers is introduced by imposing constraints on the covariance matrix, which resulted in flexible models for clustering purposes. In particular, a class of eight parsimonious mixture models based on the mixtures of factor analyzers model are introduced. Variational Gaussian approximation is used for parameter estimation, and information criteria are used for model selection. The proposed models are explored in the context of clustering discrete data arising from RNA sequencing studies. Using real and simulated data, the models are shown to give favourable clustering performance. The GitHub R package for this work is available at https://github.com/anjalisilva/mixMPLNFA and is released under the open-source MIT license.


Machine learning method improves cell identity understanding

#artificialintelligence

When genes are activated and expressed, they show patterns in cells that are similar in type and function across tissues and organs. Discovering these patterns improves our understanding of cells--which has implications for unveiling disease mechanisms. The advent of spatial transcriptomics technologies has allowed researchers to observe gene expression in their spatial context across entire tissue samples. But new computational methods are needed to make sense of this data and help identify and understand these gene expression patterns. A research team led by Jian Ma, the Ray and Stephanie Lane Professor of Computational Biology in Carnegie Mellon University's School of Computer Science, has developed a machine learning tool to fill this gap. Their paper on the method, called SPICEMIX, appeared as the cover story in the most recent issue of Nature Genetics.


Zero-shot stance detection based on cross-domain feature enhancement by contrastive learning

Zhao, Xuechen, Zou, Jiaying, Zhang, Zhong, Xie, Feng, Zhou, Bin, Tian, Lei

arXiv.org Artificial Intelligence

Zero-shot stance detection is challenging because it requires detecting the stance of previously unseen targets in the inference phase. The ability to learn transferable target-invariant features is critical for zero-shot stance detection. In this work, we propose a stance detection approach that can efficiently adapt to unseen targets, the core of which is to capture target-invariant syntactic expression patterns as transferable knowledge. Specifically, we first augment the data by masking the topic words of sentences, and then feed the augmented data to an unsupervised contrastive learning module to capture transferable features. Then, to fit a specific target, we encode the raw texts as target-specific features. Finally, we adopt an attention mechanism, which combines syntactic expression patterns with target-specific features to obtain enhanced features for predicting previously unseen targets. Experiments demonstrate that our model outperforms competitive baselines on four benchmark datasets.


Identifying Undiagnosable Cancers Using Machine Learning

#artificialintelligence

The first step in choosing the appropriate treatment for a cancer patient is to identify their specific type of cancer, including determining the primary site -- the organ or part of the body where the cancer begins. In rare cases, the origin of a cancer cannot be determined, even with extensive testing. Although these cancers of unknown primary tend to be aggressive, oncologists must treat them with non-targeted therapies, which frequently have harsh toxicities and result in low rates of survival. A new deep-learning approach developed by researchers at the Koch Institute for Integrative Cancer Research at MIT and Massachusetts General Hospital (MGH) may help classify cancers of unknown primary by taking a closer look the gene expression programs related to early cell development and differentiation. "Sometimes you can apply all the tools that pathologists have to offer, and you are still left without an answer," says Salil Garg, a Charles W. (1955) and Jennifer C. Johnson Clinical Investigator at the Koch Institute and a pathologist at MGH. "Machine learning tools like this one could empower oncologists to choose more effective treatments and give more guidance to their patients."


Using machine learning to identify undiagnosable cancers

#artificialintelligence

The first step in choosing the appropriate treatment for a cancer patient is to identify their specific type of cancer, including determining the primary site -- the organ or part of the body where the cancer begins. In rare cases, the origin of a cancer cannot be determined, even with extensive testing. Although these cancers of unknown primary tend to be aggressive, oncologists must treat them with non-targeted therapies, which frequently have harsh toxicities and result in low rates of survival. A new deep-learning approach developed by researchers at the Koch Institute for Integrative Cancer Research at MIT and Massachusetts General Hospital (MGH) may help classify cancers of unknown primary by taking a closer look the gene expression programs related to early cell development and differentiation. "Sometimes you can apply all the tools that pathologists have to offer, and you are still left without an answer," says Salil Garg, a Charles W. (1955) and Jennifer C. Johnson Clinical Investigator at the Koch Institute and a pathologist at MGH. "Machine learning tools like this one could empower oncologists to choose more effective treatments and give more guidance to their patients."


Machine learning uncovers 'genes of importance' in agriculture

#artificialintelligence

Machine learning can pinpoint "genes of importance" that help crops to grow with less fertilizer, according to a new study published in Nature Communications. It can also predict additional traits in plants and disease outcomes in animals, illustrating its applications beyond agriculture. Using genomic data to predict outcomes in agriculture and medicine is both a promise and challenge for systems biology. Researchers have been working to determine how to best use the vast amount of genomic data available to predict how organisms respond to changes in nutrition, toxins and pathogen exposure--which in turn would inform crop improvement, disease prognosis, epidemiology and public health. However, accurately predicting such complex outcomes in agriculture and medicine from genome-scale information remains a significant challenge.